Supervised or unsupervised & model types¶
Peer Herholz (he/him)
Postdoctoral researcher - NeuroDataScience lab at MNI/McGill, UNIQUE
Member - BIDS, ReproNim, Brainhack, Neuromod, OHBM SEA-SIG
@peerherholz
Aim(s) of this section¶
learn about the distinction between supervised & unsupervised machine learning
get to know the variety of potential models within each
Outline for this section¶
supervised vs. unsupervised learning
supervised learning examples
unsupervised learning examples
A brief recap & first overview¶
let’s bring back our rough analysis outline that we introduced in the previous section

so far we talked about how a Model (M) can be utilized to obtain information (output) from a certain input
the information requested can be manifold but roughly be situated on two broad levels:
learning problem
supervised or unsupervised
specific task type
predicting clinical measures, behavior, demographics, other properties
segmentation
discover hidden structures
etc.
https://scikit-learn.org/stable/_static/ml_map.png
https://scikit-learn.org/stable/_static/ml_map.png
Learning problems - supervised vs. unsupervised¶

if we now also include task type we can basically describe things via a 2 x 2 design:

Our example dataset¶
Now that we’ve gone through a huge set of definitions and road maps, let’s go away from this rather abstract discussions to the “real deal”, i.e. seeing how these models behave in the wild. For this we’re going to sing the song “Hello example dataset my old friend, I came to apply machine learning to you again.”. Just to be sure: we will use the example dataset we briefly explored in the previous section again to showcase how the models we just talked about can be put into action, as well as how they change/affect the questions we can address and we have to interpret the results.
At first, we’re going to load our input data, i.e. X again:
import numpy as np
data = np.load('MAIN2019_BASC064_subsamp_features.npz')['a']
data.shape
(155, 2016)
just as a reminder: what we have in
Xhere is avectorized connectivity matrixcontaining2016features, which constitutes the correlation between brain region-specific time courses for each of155samples(participants)
as before, we can visualize our
Xto inspect it and maybe get a first idea if there might be something going on
import plotly.express as px
from IPython.core.display import display, HTML
from plotly.offline import init_notebook_mode, plot
fig = px.imshow(data, labels=dict(x="features", y="participants"), height=800, aspect='None')
fig.update(layout_coloraxis_showscale=False)
init_notebook_mode(connected=True)
#fig.show()
plot(fig, filename = 'input_data.html')
display(HTML('input_data.html'))
at this point we already need to decide on our
learning problem:do we want to utilize the information we already have (
labels) and thus conduct asupervised learninganalysis to predictYdo we not want to utilize the information we already have and thus conduct an
unsupervised learninganalysis to e.g. find clusters or decompose
please note: we only do this for the sake of this workshop! Please never do this type of “Hm, maybe we do this or this, let’s see how it goes.” approach in your research. Always make sure you have a precise analyses plan that is informed by prior research and guided by the possibilities of your data. Otherwise you’ll just add to the ongoing reproducibility and credibility crisis, not accelerating but hindering scientific progress. (However, the other option is that you conduct exploratory analyses and just be honest about it, not acting as they are confirmatory.)
that being said: we’re going to basically test of all them (talking about “to not practise what one preaches”, eh?), again, solely for teaching reasons
we’re going to start with
supervised learning, thus using the information we already have
Supervised learning¶
independent of the precise
task typewe want to run, we initially need to load the information, i.e.labels, available to us:
import pandas as pd
information = pd.read_csv('participants.csv')
information.head(n=5)
| participant_id | Age | AgeGroup | Child_Adult | Gender | Handedness | |
|---|---|---|---|---|---|---|
| 0 | sub-pixar123 | 27.06 | Adult | adult | F | R |
| 1 | sub-pixar124 | 33.44 | Adult | adult | M | R |
| 2 | sub-pixar125 | 31.00 | Adult | adult | M | R |
| 3 | sub-pixar126 | 19.00 | Adult | adult | F | R |
| 4 | sub-pixar127 | 23.00 | Adult | adult | F | R |
as you can see, we have multiple variables, i.e.
labelsdescribing our participants, i.e.samplesand almost each of them can be used to address asupervised learningproblem (e.g.Child_Adult)
Supervised learning¶
goal: Learn parameters (or weights) of a model (M) that maps X to y
however, while some are
categoricaland thus could be employed within aclassificationanalysis, some arecontinuousand thus would fit within aregressionanalysis (e.g.Age)
we’re going to check both
Supervised learning - classification¶
goal: Learn parameters (or weights) of a model (M) that maps X to y
in order to run a
classificationanalysis, we need to obtain the correctcategorical labelsdefining them as ourY
Y_cat = information['Child_Adult']
Y_cat.describe()
count 155
unique 2
top child
freq 122
Name: Child_Adult, dtype: object
we can see that we have two unique expressions, but let’s plot the distribution just to be sure and maybe see something important/interesting:
fig = px.histogram(Y_cat, marginal='box', template='plotly_white')
fig.update_layout(showlegend=False)
init_notebook_mode(connected=True)
#fig.show()
plot(fig, filename = 'labels.html')
display(HTML('labels.html'))
that looked about right and we can continue with our analysis
to keep things easy, we will use the same pipeline we employed in the previous section, that is we will scale our input data, train a Support Vector Machine and test its predictive performance:
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
pipe = make_pipeline(
... StandardScaler(),
... SVC()
... )
A bit of information about Support Vector Machines:
non-probabilistic binary classifier
Pros
effective in high dimensional spaces
Still effective in cases where number of dimensions is greater than the number of samples.
uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
versatile: different Kernel functions
Cons
if number of features is much greater than the number of samples: danger of over-fitting
make sure to check kernel and regularization
SVMs do not directly provide probability estimates
before we can go further, we need to divide our input data
Xintotrainingandtestsets:
X_train, X_test, y_train, y_test = train_test_split(data, Y_cat, random_state=0)
and can already fit our
analysis pipeline:
pipe.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()), ('svc', SVC())])
followed by testing the
model’s predictive performance:
print('accuracy is %s with chance level being %s' %(accuracy_score(pipe.predict(X_test), y_test), 1/len(pd.unique(Y_cat))))
accuracy is 0.8974358974358975 with chance level being 0.5
(spoiler alert: can this be right?)
Supervised learning - regression¶
after seeing that we can obtain a super high accuracy using a
classificationapproach, we’re hooked and want to check if we can get even higher scores via addressing our learning problem via aregressionapproach
for this to work, we need to change our
labels, i.e.Yfrom acategoricalto acontinuousvariable:
information.head(n=5)
| participant_id | Age | AgeGroup | Child_Adult | Gender | Handedness | |
|---|---|---|---|---|---|---|
| 0 | sub-pixar123 | 27.06 | Adult | adult | F | R |
| 1 | sub-pixar124 | 33.44 | Adult | adult | M | R |
| 2 | sub-pixar125 | 31.00 | Adult | adult | M | R |
| 3 | sub-pixar126 | 19.00 | Adult | adult | F | R |
| 4 | sub-pixar127 | 23.00 | Adult | adult | F | R |
here
Ageseems like a good fit:
Y_con = information['Age']
Y_con.describe()
count 155.000000
mean 10.555189
std 8.071957
min 3.518138
25% 5.300000
50% 7.680000
75% 10.975000
max 39.000000
Name: Age, dtype: float64
however, we are of course going to plot it again (reminder: always check your data):
fig = px.histogram(Y_con, marginal='box', template='plotly_white')
fig.update_layout(showlegend=False)
init_notebook_mode(connected=True)
#fig.show()
plot(fig, filename = 'labels.html')
display(HTML('labels.html'))
the only thing we need to do to change our previous
analysis pipelineaclassificationto aregressiontask is to adapt theestimatoraccordingly:
from sklearn.linear_model import LinearRegression
pipe = make_pipeline(
... StandardScaler(),
... LinearRegression()
... )
A bit of information about regression
modelling the relationship between a scalar response and one or more explanatory variables
Pros
simple implementation, efficient & fast
good performance in linear separable datasets
can address overfitting via regularization
Cons
prone to underfitting
outlier sensitivity
assumption of independence